To perform online iNMF, we need to install the online branch. Please see the instruction below.
library(devtools)
install_github("welch-lab/liger", ref = "online")
We first create a liger object by passing the filenames of HDF5 files containing the raw count data. The data can be downloaded here. Liger assumes by default that the HDF5 files are formatted by the 10X CellRanger pipeline. Large datasets are often generated over multiple 10X runs (for example, multiple biological replicates). In such cases it may be necessary to merge the HDF5 files from each run into a single HDF5 file. We provide the mergeH5 function for this purpose (see below for details).
library(liger)
pbmcs = createLiger(list(stim = "pbmcs_stim.h5", ctrl = "pbmcs_ctrl.h5"))
We then perform the normalization, gene selection, and gene scaling in an online fashion, reading the data from disk in small batches.
pbmcs = normalize(pbmcs)
pbmcs = selectGenes(pbmcs, var.thresh = 0.2, do.plot = F)
pbmcs = scaleNotCenter(pbmcs)
Now we can use online iNMF to factorize the data, again using only minibatches that we read from the HDF5 files on demand (default mini-batch size = 5000).
pbmcs = online_iNMF(pbmcs, k = 20, max.epochs = 5)
After performing the factorization, we can perform quantile normalization to align the datasets.
pbmcs = quantile_norm(pbmcs)
We can also visualize the cell factor loadings in two dimensions using t-SNE or UMAP.
pbmcs = runUMAP(pbmcs)
plotByDatasetAndCluster(pbmcs, axis.labels = c("UMAP1","UMAP2"))
Let’s first evaluate the quality of data alignment. The alignment score ranges from 0 (no alignment) to 1 (perfect alignment).
calcAlignment(pbmcs)
## [1] 0.9149238
With HDF5 files as input, we need to sample the raw, normalized, or scaled data from the full dataset on disk and load them in memory. Some plotting functions and downstream analyses are designed to operate on a subset of cells sampled from the full dataset. This enables rapid analysis using limited memory. The readSubset function allows either uniform random sampling or sampling balanced by cluster. Here we extract the normalized count data of 5000 sampled cells.
pbmcs = readSubset(pbmcs, slot.use = "norm.data", max.cells = 5000)
Using the sampled data stored in memory, we can now compare clusters or datasets (within each cluster) to identify differentially expressed genes. The runWilcoxon function performs differential expression analysis by performing an in-memory Wilcoxon rank-sum test on this subset. Thus, users can still analyze large datasets with a fixed amount of memory.
de_genes = runWilcoxon(pbmcs, compare.method = "datasets")
Here we show the top 10 genes in cluster 1 whose expression level significantly differ between two dataset.
de_genes = de_genes[order(de_genes$padj), ]
head(de_genes[de_genes$group == "1-stim",], n = 10)
## feature group avgExpr logFC statistic auc pval
## 14061 ISG15 1-stim -5.430433 13.820208 121239.0 0.9898839 6.419043e-114
## 14324 IFI6 1-stim -6.843238 14.763583 118624.5 0.9685372 2.942152e-108
## 23989 ISG20 1-stim -5.779989 9.286306 117816.5 0.9619401 4.421875e-99
## 21242 IFIT3 1-stim -8.320794 14.539736 114625.0 0.9358824 1.133535e-98
## 21244 IFIT1 1-stim -9.094809 13.669177 112456.0 0.9181731 2.367054e-92
## 27532 MX1 1-stim -8.935794 12.875135 111536.5 0.9106656 5.884017e-87
## 20364 LY6E 1-stim -8.721669 12.810795 111435.0 0.9098369 9.506436e-86
## 23754 B2M 1-stim -3.188800 0.461932 108547.5 0.8862612 3.907049e-69
## 21241 IFIT2 1-stim -11.928663 10.782880 102299.0 0.8352439 3.361002e-66
## 14634 IFI44L 1-stim -12.345872 9.720498 98779.5 0.8065081 3.120570e-55
## padj pct_in pct_out
## 14061 9.020681e-110 100 100
## 14324 2.067303e-104 100 100
## 23989 2.071354e-95 100 100
## 21242 3.982393e-95 100 100
## 21244 6.652841e-89 100 100
## 27532 1.378135e-83 100 100
## 20364 1.908485e-82 100 100
## 23754 6.863219e-66 100 100
## 21241 5.248019e-63 100 100
## 14634 4.385337e-52 100 100
We can show UMAP coordinates of sampled cells by their loadings on each factor (Factor 1 as an example). Underneath it displays the most highly loading shared and dataset-specific genes, with the size of the marker indicating the magnitude of the loading.
p_wordClouds = plotWordClouds(pbmcs, num.genes = 5, return.plots = T)
p_wordClouds[[1]]
We can generate plots of dimensional reduction coordinates colored by expression of specified gene. The first two UMAP dimensions and gene ISG15 (identified by Wilcoxon test in the previous step) is used as an example here.
plotGene(pbmcs, "ISG15", return.plots = F)
Furthermore, we can make violin plots of expression of specified gene for each dataset (ISG15 as an example).
plotGeneViolin(pbmcs, "ISG15", return.plots = F)
The online algorithm can be implemented on datasets loaded in memory as well. The same analysis is performed on the PBMCs, shown below.
stim = readRDS("pbmcs_stim.RDS")
ctrl = readRDS("pbmcs_ctrl.RDS")
pbmcs_mem = createLiger(list(stim = stim, ctrl = ctrl), remove.missing = F)
pbmcs_mem = normalize(pbmcs_mem)
pbmcs_mem = selectGenes(pbmcs_mem, var.thresh = 0.2, do.plot = F)
pbmcs_mem = scaleNotCenter(pbmcs_mem)
pbmcs_mem = online_iNMF(pbmcs_mem, k = 20, max.epochs = 5)
pbmcs_mem = quantile_norm(pbmcs_mem)
pbmcs_mem = runUMAP(pbmcs_mem)
plotByDatasetAndCluster(pbmcs_mem, axis.labels = c("UMAP1","UMAP2"))
As mentioned above, it is sometimes necessary to merge multiple HDF5 files (such as multiple 10X runs from the same tissue or condition) into a single file. We provide the mergeH5 function for this purpose. The function takes as input a list of filenames to merge (file.list), a vector of sample names that are prepended to the cell barcodes (library.names), and the name of the merged HDF5 file. The function requires that all files to be merged include exactly the same set of genes. For example, we can merge the cells and nuclei datasets used in the examples below (note that merging these particular two datasets is not something you would like to do, and is purely to demonstrate the mergeH5 function).
mergeH5(file.list = list("allen_smarter_cells.h5", "allen_smarter_nuclei.h5"),
library.names = c("cells","nuclei"),
new.filename = "cells_nuclei")
##Scenario 2: iterative refinement by incorporating new datasets We can also perform online iNMF with continually arriving datasets.
MOp = createLiger(list(cells = "allen_smarter_cells.h5"))
MOp = normalize(MOp)
MOp = selectGenes(MOp, var.thresh = 2)
MOp.vargenes = MOp@var.genes
MOp = scaleNotCenter(MOp)
MOp = online_iNMF(MOp, k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))
MOp2 = createLiger(list(nuclei = "allen_smarter_nuclei.h5"))
MOp2 = normalize(MOp2)
MOp2@var.genes = MOp@var.genes
MOp2 = scaleNotCenter(MOp2)
MOp = online_iNMF(MOp, X_new = list(nuclei = MOp2), k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))
##Scenario 3: projecting new datasets
MOp = createLiger(list(cells = "allen_smarter_cells.h5"))
MOp@var.genes = MOp.vargenes
MOp = online_iNMF(MOp, k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))
MOp = online_iNMF(MOp, X_new = list(nuclei = MOp2), k = 40, project = TRUE)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))